Business Goal - Moving from preventive maintenance to predictive

Is there a competitive advantage in how businesses manage their key assets?
Predicting failures impacts the bottom line. Consider the following:

On the other hand, predictions are not perfect.

In this analysis, I assign costs to these prediction scenarios in order to find an “optimal” solution that works for the business.

Description of Dataset

This data set is the Kaggle version of the very well known public data set for asset degradation modeling from NASA. We are given 100 simulated jet engine data, with over 20 sensor readings contaminated with considerable noise.

The goal is to predict the “Remaining Useful Life” given sensor readings for a particular engine cycle (possibly equivalent to flights or days).

The dataset granularity is engine (unit_number) and cycle. Thus, one engine could have many records, each record pertaining to a specific cycle.

Each engine in the training dataset is run-to-failure. Thus we can precisely calculate the RUL for every cycle. However, the validation dataset has engines that are observed at a specific point in time; some are still running, and some have failed.

My Approach

Step 1: Simplify the problem for a quick win. Instead of predicting RUL, can we predict engines that have less than 30 cycles remaining before failure? This moves from a regression problem to a classification problem. I came to this conclusion after exploring the data and seeing clear demarcation between sensor values on both sides of this boundary. Presumably, 30 cycles would be enough time to secure parts needed for maintenance; this is where the business would need to weigh in. Certainly, the more lead time the better, but that also comes with less confidence in the prediction - it’s a balance.

I used two different classification algorithms: Logistic Regression and eXtreme Gradient Boosting Each has a different optimization method and may lead to different results depending on the data.

Step 2: Once we get learnings from this, then, predict the RUL which will consider the sequential component of our data. This will require more advanced neural network algorithms like LSTM, (Long Short-Term Memory). I will tackle this in another script.

The dataset can be found here:
https://www.kaggle.com/datasets/behrad3d/nasa-cmaps https://www.nasa.gov/content/prognostics-center-of-excellence-data-set-repository

Conclusions

First, the majority of the training data represented “no failure” cycles, as expected. Most machine learning algorithms work best when the number in each class (failures vs no failures ) is about the same. Otherwise, we could get very high accuracy just by focusing on the majority class. Thus, I downsampled the majority class to be approximately the same ratio as the failures.

Second, the algorithms I used provided a probability of failure. This is where I introduced the business cost of an unplanned failure (false negatives) vs prematurely predicting failure (false positives). These costs can radically change the outcome so these need to be considered carefully.

Third, I incorporated survival curves from the statistical field of “survival analysis”. I did this because these curves are a powerful way to share the big picture with business customers. They actually are a model as well, but, I’m using them in a descriptive vs predictive way for visualizing engine deterioration over time.


Results


For the machine learning algorithms, the metrics we care about are Sensitivity and Specificity.

Sensitivity: When the engines actually failed within 30 cycles, how often was the prediction “yes”?

Specificity: When the engines did not fail within 30 cycles, how often was the prediction “no”?

Compare Cost of doing nothing (run to failure) and predicting/preventing failures using Machine learning A disproportionate cost (penalty) is applied to each false negative. Here, the units are not important, we are looking at the relative cost between these different approaches. Machine learning approaches offer significant savings.

1. Preparation
1.1 Load Libraries and packages
# Package names
packages <- c("tidyverse",  # wrangling
              "corrplot",   # correlation plot
              "ggplot2" ,   # plots
              "parsnip",    # wrapper package for modeling
              "recipes",    # Preprocessing & Sampling
              "rsample",    # Preprocessing & Sampling
              "rpart.plot", # plot simple Decision Tree
              "tidyquant",  # theme for visuals
              "ROCR"  ,     # needed for ROC curve
              "themis" ,    # downsample
              "survival",   # survival analysis
              "survminer",  # survival plot
              "xgboost",    # algorithm
              "dplyr",      # wrangling
              "correlationfunnel", # correlations, automatic binning
              "plotly",     # visualization
              "caret" ,     # confusion matrix
              "kableExtra"  # Data Formatting
              )

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Packages loading
invisible(lapply(packages, library, character.only = TRUE))
1.2 Column Headers
identifiers <- c('unit_number', 'time_cycles')
setting_names <- c('setting_1', 'setting_2', 'setting_3')
sensor_names <- paste0('sensor_', 1:21)
col_names <- c(identifiers, setting_names, sensor_names)
1.3 Import Data, attach column headings
file_path  <- "data/prevMtce/RUL_NASA_Engines"
file_train <- "train_FD001.txt"
file_test  <- "test_FD001.txt"
file_RUL   <- "RUL_FD001.txt"

train_raw  <- read_delim(paste0(file_path,'/',file_train), delim = " ", col_names = col_names, show_col_types = FALSE)
valid_raw   <- read_delim(paste0(file_path,'/',file_test), delim = " ", col_names = col_names, show_col_types = FALSE)
valid_answer_raw    <- read_delim(paste0(file_path,'/',file_RUL), delim = " ", col_names = "rul_last", show_col_types = FALSE)


train_raw_tbl <- tibble(train_raw)
valid_raw_tbl  <- tibble(valid_raw)
valid_answer_raw_tbl   <- tibble(valid_answer_raw)

# remove extraneous cols, add new helper cols
train_tbl <- train_raw_tbl %>% 
    select(everything(), -starts_with("X") ) %>% 
    mutate(count_me = 1
           ,time_cycles_norm = time_cycles
)
    
valid_tbl <- valid_raw_tbl %>% 
    select(everything(), -starts_with("X") ) %>% 
    mutate(count_me = 1
           ,time_cycles_norm = time_cycles)

# add row number as "unit number" as the original data was just the RUL with no identifier
valid_answer_tbl <- valid_answer_raw_tbl %>% 
    select(everything(), -starts_with("X") ) %>% 
    mutate(unit_number = row_number())
1.4 For Training, calculate the RUL
Use the maximum cycles to create the RUL
train_rul_tbl <- train_tbl %>% 
    group_by(unit_number) %>%
    mutate(max_cycles = max(time_cycles),
           rul = max_cycles - time_cycles
           ) %>% 
    ungroup()
1.5 For validation, calculate the RUL for the cycles we have been given.
For example, if we are given 50 cycles for unit1, and the RUL is 120 at the 50th cycle, the RUL at the 49th cycle is 121, the 48th cycle is 122, etc.
valid_max_tbl <- valid_tbl %>% 
    group_by(unit_number) %>%
    mutate(max_cycles = max(time_cycles)
           ) %>% 
    ungroup() %>% 
    # join to valid_answer_tbl
    left_join(valid_answer_tbl) %>% 
    # calculate rul 
   mutate(
       rul = rul_last + (max_cycles - time_cycles)
       ) %>% 
    select(-rul_last)
1.6 Create Train dataset with unique unit number and max cycles
We will use this in some of the exploratory plots
train_max_cycles <- train_rul_tbl  %>% 
select(  unit_number
        ,max_cycles
      ) %>% 
    distinct() 
1.7 Create Valid dataset with unique unit number and max cycles
We will use this in some of the exploratory plots
valid_max_cycles <- valid_max_tbl  %>% 
select(  unit_number
        ,max_cycles
      ) %>% 
    distinct() 
1.8 Check for Null values - None!
paste("The number of nulls in train:",sum(is.na(train_rul_tbl)), ", the number of nulls in validation:", sum(is.na(valid_max_tbl)) )
## [1] "The number of nulls in train: 0 , the number of nulls in validation: 0"
1.9 Shape of Data
## [1] "There are 20631 rows, and 30 columns in train_rul_tbl"
## [1] "There are 13096 rows, and 30 columns in valid_max_tbl"
1.10 Glimpse of Data
All of the data is numerical. Later, we will scale and center the data to avoid larger values influencing the algorithms.
glimpse(train_rul_tbl) 
## Rows: 20,631
## Columns: 30
## $ unit_number      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ time_cycles      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
## $ setting_1        <dbl> -0.0007, 0.0019, -0.0043, 0.0007, -0.0019, -0.0043, 0~
## $ setting_2        <dbl> -4e-04, -3e-04, 3e-04, 0e+00, -2e-04, -1e-04, 1e-04, ~
## $ setting_3        <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100~
## $ sensor_1         <dbl> 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.6~
## $ sensor_2         <dbl> 641.82, 642.15, 642.35, 642.35, 642.37, 642.10, 642.4~
## $ sensor_3         <dbl> 1589.70, 1591.82, 1587.99, 1582.79, 1582.85, 1584.47,~
## $ sensor_4         <dbl> 1400.60, 1403.14, 1404.20, 1401.87, 1406.22, 1398.37,~
## $ sensor_5         <dbl> 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.6~
## $ sensor_6         <dbl> 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.6~
## $ sensor_7         <dbl> 554.36, 553.75, 554.26, 554.45, 554.00, 554.67, 554.3~
## $ sensor_8         <dbl> 2388.06, 2388.04, 2388.08, 2388.11, 2388.06, 2388.02,~
## $ sensor_9         <dbl> 9046.19, 9044.07, 9052.94, 9049.48, 9055.15, 9049.68,~
## $ sensor_10        <dbl> 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3~
## $ sensor_11        <dbl> 47.47, 47.49, 47.27, 47.13, 47.28, 47.16, 47.36, 47.2~
## $ sensor_12        <dbl> 521.66, 522.28, 522.42, 522.86, 522.19, 521.68, 522.3~
## $ sensor_13        <dbl> 2388.02, 2388.07, 2388.03, 2388.08, 2388.04, 2388.03,~
## $ sensor_14        <dbl> 8138.62, 8131.49, 8133.23, 8133.83, 8133.80, 8132.85,~
## $ sensor_15        <dbl> 8.4195, 8.4318, 8.4178, 8.3682, 8.4294, 8.4108, 8.397~
## $ sensor_16        <dbl> 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03,~
## $ sensor_17        <dbl> 392, 392, 390, 392, 393, 391, 392, 391, 392, 393, 392~
## $ sensor_18        <dbl> 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388,~
## $ sensor_19        <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100~
## $ sensor_20        <dbl> 39.06, 39.00, 38.95, 38.88, 38.90, 38.98, 39.10, 38.9~
## $ sensor_21        <dbl> 23.4190, 23.4236, 23.3442, 23.3739, 23.4044, 23.3669,~
## $ count_me         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ time_cycles_norm <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
## $ max_cycles       <dbl> 192, 192, 192, 192, 192, 192, 192, 192, 192, 192, 192~
## $ rul              <dbl> 191, 190, 189, 188, 187, 186, 185, 184, 183, 182, 181~
2.0 Exploratory Data Analysis (EDA)
2.1 Create a Histogram Function
create_histogram <- function(data, main_title="title", xlab="xlab", breaks=20) {
hist(data
     , main = main_title
     , xlab = xlab
     , breaks = breaks)

# Calculate the average and median
average <- mean(data)
average_str <- as.character(average)
median <- median(data)
median_str <- as.character(median)

# Add a vertical line for the median
abline(v = average, col = "orange", lwd = 2)
abline(v = median, col = "blue", lwd = 2)

legend("topright", legend = c(paste("Average",average_str), paste("Median",median_str)),
       col = c("orange", "blue"), lwd = 2)    
    
}
2.2 Training Data: Histogram of Max Cycle at Failure
We can see quite a bit of spread in failure times; in general, it’s uncommon to see failures below 100 cycles and very few engines last over 300 cycles.
create_histogram(train_max_cycles$max_cycles,main_title="Training Data: Max Cycle at Failure", xlab="Cycles",breaks = seq(min(train_max_cycles$max_cycles) - 30, max(train_max_cycles$max_cycles) + 30, by = 10) )

2.3 Validation Data: Histogram Max Cycles
Recall, the validation data is intentionally incomplete; we don’t yet know the max cycle at * failure *. Here, we observe the max cycles that we have been given in the dataset, which could be very early in the engine life or at the end.
We see that our max cycles span aproximately the same range as the training dataset. Net, the cycles in our training data should represent the validation data well.
create_histogram(valid_max_cycles$max_cycles,main_title="Validation Data: Max Cycle Given (Not Necessarily a Failure)", xlab="Cycles", breaks = seq(min(valid_max_cycles$max_cycles) - 30, max(valid_max_cycles$max_cycles) + 30, by = 10))

2.4 In addition to sensor readings, we are given 3 “settings”.
We will remove “Setting 3” since it is constant for both Training and Validation data.
# Create a new plot with 1 row and 3 columns
par(mfrow = c(1, 3))
setting1_hist <- hist(train_rul_tbl$setting_1, main = "Setting 1", xlab = "")
setting2_hist <- hist(train_rul_tbl$setting_2, main = "Setting 2", xlab = "")
setting3_hist <- hist(train_rul_tbl$setting_3, main = "Setting 3", xlab = "")

Does each unit maintain a constant setting?
No: these settings are not “dialed in”, or static for each unit. Here, we see the number of distinct settings for each unit. These can vary by cycle of the engine, thus, they could provide valuable information.
train_rul_tbl %>% 
    select(unit_number, setting_1, setting_2, setting_3, count_me) %>% 
    distinct() %>% 
    group_by(unit_number) %>% 
    summarise(sum_distinct_settings = sum(count_me)) 
## # A tibble: 100 x 2
##    unit_number sum_distinct_settings
##          <dbl>                 <dbl>
##  1           1                   170
##  2           2                   242
##  3           3                   165
##  4           4                   166
##  5           5                   219
##  6           6                   174
##  7           7                   217
##  8           8                   133
##  9           9                   176
## 10          10                   193
## # i 90 more rows
2.5 Histograms of Predictors
We can see low/no variation in the following sensors, so they will be removed when modeling as they won’t provide predictive power.
“sensor_1”, “sensor_5”, “sensor_6”, “sensor_10”, “sensor_16”, “sensor_18”, “sensor_19”
plot_hist_facet <- function(data, bins = 10, ncol = 5,
                            fct_reorder = FALSE, fct_rev = FALSE, 
                            fill = "gray", #palette_light()[[3]], 
                            color = "white", scale = "free") {
    
    data_factored <- data %>%
        mutate_if(is.character, as.factor) %>%
        mutate_if(is.factor, as.numeric) %>%
        gather(key = key, value = value, factor_key = TRUE) 
    
    if (fct_reorder) {
        data_factored <- data_factored %>%
            mutate(key = as.character(key) %>% as.factor())
    }
    
    if (fct_rev) {
        data_factored <- data_factored %>%
            mutate(key = fct_rev(key))
    }
    
    g <- data_factored %>%
        ggplot(aes(x = value, group = key)) +
        geom_histogram(bins = bins, fill = fill, color = color) +
        facet_wrap(~ key, ncol = ncol, scale = scale) + 
        theme_tq()
    
    return(g)
    
}
Remove sensor_1

Remove sensor_5

Remove sensor_10

Remove sensor_16, 18

Remove sensor_19

2.6 Skewness - do we need to do any transformations?
Sensor 6 has the largest skew; if you look at the histogram for this sensor, above, you can see that it is nearly a constant. Thus, we will be removing this variable anyway.
skewed_feature_names <- train_rul_tbl %>%
    select_if(is.numeric) %>%
    map_df(skewness) %>%
    gather(factor_key = T) %>%
    arrange(desc(abs(value))) %>% 
    kbl(caption = "Skew") %>%
    kable_classic(full_width = F, html_font = "Cambria")

skewed_feature_names
Skew
key value
sensor_6 -6.9163101
sensor_9 2.5551791
sensor_14 2.3723811
max_cycles 0.8887062
time_cycles 0.4998676
time_cycles_norm 0.4998676
rul 0.4998676
sensor_8 0.4793760
sensor_13 0.4697583
sensor_11 0.4692950
sensor_4 0.4431621
sensor_12 -0.4423751
sensor_7 -0.3943003
sensor_15 0.3882304
sensor_20 -0.3584191
sensor_17 0.3531000
sensor_21 -0.3503495
sensor_2 0.3165029
sensor_3 0.3089233
unit_number -0.0678103
setting_1 -0.0247645
setting_2 0.0090845
setting_3 NaN
sensor_1 NaN
sensor_5 NaN
sensor_10 NaN
sensor_16 NaN
sensor_18 NaN
sensor_19 NaN
count_me NaN
2.7 Line Charts
2.7.1 Line charts for 10 random units
Gray Vertical line is at 30 RUL cycles
Observations:
Settings 1 & 2 do not have an identifiable pattern, and, do not show any correlation to failure.
Sensor 9, 14 look identical
# Function to create line charts with moving average
create_line_chart <- function(data, column_names, descriptive_names, unit_numbers, window_size = 10) {
  # Filter the data based on unit_numbers
  filtered_data <- data %>%
    filter(unit_number %in% unit_numbers)
  
  # Loop through the columns
  for (i in 1:length(column_names)) {
    # Get the column name, descriptive name, and color
    column <- column_names[i]
    descriptive_name <- descriptive_names[i]
    
    # Create a new data frame for storing the moving averages
    moving_averages <- data.frame()
    
    # Loop through the unit_numbers
    for (unit in unit_numbers) {
      # Filter the data for the current unit_number
      unit_data <- filtered_data %>%
        filter(unit_number == unit)
      
      # Calculate moving average for the current unit
      moving_average <- rollmean(unit_data[[column]], k = window_size, fill = NA, align = "right")
      
      # Create a new data frame for the current unit's moving average
      unit_moving_average <- data.frame(rul = unit_data$rul, moving_average, unit_number = unit)
      
      # Add the unit_moving_average to the overall moving_averages data frame
      moving_averages <- bind_rows(moving_averages, unit_moving_average)
    }
    
    # Create the line chart with moving averages
    p <- ggplot(moving_averages, aes(x = rul, y = moving_average, color = factor(unit_number))) +
      geom_line() +
       geom_vline(xintercept = 30, col = "gray", lwd = 1) +
      labs(title = c(paste(descriptive_name,"Window Size for Moving Avg:", as.character(window_size)) ), x = "RUL", y = descriptive_name) +
      theme_minimal() +
      scale_x_continuous(trans = "reverse")  # Reverse the x-axis ticks
    
    # Display the chart
    print(p)
  }
}
# Select random 10 units to show in line charts
set.seed(42)  # Set seed for reproducibility
sample_nums <- sample(1:100, 10)
unit_number_list <- as.character(sample_nums)

# create vector of  column names that have no variance
no_var_colnames <- c("unit_number", "sensor_1", "sensor_5", "sensor_6", "sensor_10", "sensor_16", "sensor_18", "sensor_19", "setting_3", "rul", "count_me", "time_cycles", "max_cycles", "fail_in_30")

column_names <- colnames(train_rul_tbl)
cols_for_chart    <- column_names[!(column_names %in% no_var_colnames)]


# Call the function
create_line_chart(data = train_rul_tbl,
                  column_names = cols_for_chart,
                  descriptive_names = cols_for_chart,
                  unit_numbers = unit_number_list,
                  window_size = 10
                    )

2.7.2 What is going on with Sensor 9 & 14? They do not behave the same for all units. Let’s look at more data for sensor_9.
There is something unexplainable going on here, possibly an interaction with another variable that is specific to certain units. This is a perfect case for subject matter experts to weigh in. For my modeling, I’m going to remove these sensors.
sample_nums <- seq(1:100)
unit_number_list <- as.character(sample_nums)


cols_for_chart    <- "sensor_9"
create_line_chart(data = train_rul_tbl,
                  column_names = cols_for_chart,
                  descriptive_names = cols_for_chart,
                  unit_numbers = unit_number_list,
                  window_size = 10
                    )

2.8 We have class imbalance in our Training data for target variable: “fail_in_30”
Training data has a much higher proportion of non-failure data than failures, as expected. This is because there is only one failure record for possibly hundreds of non-failures for each unit. Although expected, this can make it more challenging to model failures.
train_rul_tbl <- train_rul_tbl %>% 
                            mutate(fail_in_30 = as.factor(case_when(rul <= 30  ~1,
                                        TRUE ~ 0)))


#train
pct<-round(table(train_rul_tbl$fail_in_30)/length(train_rul_tbl$fail_in_30)*100,1)
labs<-c("No-Fail-30", "Fail-30")
labs<-paste(labs,pct)
labs<-paste(labs, "%", sep="")
pie(table(train_rul_tbl$fail_in_30),labels=labs,col=c("gray", "navy"),main="Train: %Fail")

Validation data is imbalanced also, as expected. We will not make any adjustments to the validation data as this needs to remain pure.
valid_max_tbl <- valid_max_tbl %>%
                          mutate(fail_in_30 = as.factor(case_when(rul <= 30  ~1,
                                        TRUE ~ 0)))


pct<-round(table(valid_max_tbl$fail_in_30)/length(valid_max_tbl$fail_in_30)*100,1)
labs<-c("No-Fail-30", "Fail-30")
labs<-paste(labs,pct)
labs<-paste(labs, "%", sep="")
pie(table(valid_max_tbl$fail_in_30),labels=labs,col=c("gray", "navy"),main="Validation:%Fail")

2.8.1 Downsample the majority group to create balance - Training data
set.seed(42)

desired_minority_count <- round(nrow(train_rul_tbl) * 0.15)

# Sample the majority group
sampled_majority <- train_rul_tbl %>%
  filter(fail_in_30 == 0) %>%
  slice_sample(n = desired_minority_count, replace = FALSE)  # Adjust the multiplication factor based on majority to minority ratio (17531/3100)

# Sample the minority group
sampled_minority <- train_rul_tbl %>%
  filter(fail_in_30 == 1) %>%
  slice_sample(n = desired_minority_count, replace = FALSE)

# Combine the sampled majority and minority groups
train_sampled_tbl <- bind_rows(sampled_majority, sampled_minority)


#train
pct<-round(table(train_sampled_tbl$fail_in_30)/length(train_sampled_tbl$fail_in_30)*100,1)
labs<-c("No-Fail-30", "Fail-30")
labs<-paste(labs,pct)
labs<-paste(labs, "%", sep="")
pie(table(train_sampled_tbl$fail_in_30),labels=labs,col=c("gray", "navy"),main="Percent Fail in 30")

2.8.2 Shape of downsampled training data
print(paste("After downsampling, there are" , as.character(nrow(train_sampled_tbl)), "rows, and", as.character(ncol(train_sampled_tbl)), "columns in",deparse(substitute(train_sampled_tbl)) ))
## [1] "After downsampling, there are 6190 rows, and 31 columns in train_sampled_tbl"
summary(train_sampled_tbl$fail_in_30)
##    0    1 
## 3095 3095
3.0 Pre-processing, Scaling
Remove variables with 0 variance as they have no predictive power.
Scale variables to aide the algorithm optimization process - the mean will be 0, and standard deviation of 1
# ?recipe
train_predictors <- train_sampled_tbl %>% 
                    select(-unit_number,-count_me,-max_cycles, -time_cycles, -rul)



recipe_scaled_obj <- recipe(fail_in_30 ~ . , data=train_predictors) %>% 
              step_nzv(all_predictors()) %>% 
              step_center(all_predictors()) %>% 
              step_scale( all_predictors()) %>% 
              prep()


train_scaled_tbl <- bake(recipe_scaled_obj, new_data = train_predictors)
Check Pre-Processing for scaling
train_scaled_tbl %>% 
    select(setting_1, setting_2, sensor_2) %>% 
    plot_hist_facet(bins = 10, ncol = 3, fct_rev = F)

Pre-Processing for Validation Data
valid_sampled1_tbl <- valid_max_tbl %>% 
                    select(-unit_number,-count_me,-max_cycles, -time_cycles, -rul)


valid_scaled_tbl <- bake(recipe_scaled_obj, new_data = valid_sampled1_tbl)
4.0 Assess the binary classification (failure in 30 cycles)
Using Boxplots of sensors vs our binary response, we can see differentiation between our two groups (no fail=0, fail = 1). This is exactly what we are looking for.
create_boxplots <- function(data, column_names, display_raw_data = FALSE) {
  
  # Convert "fail_in_30" to a factor if it's not already
  if (!is.factor(data$fail_in_30)) {
    data$fail_in_30 <- factor(data$fail_in_30)
  }

  # Loop through each column and create box plots
  for (column in column_names) {
  
    # Create a title based on the column name
    title <- paste("Box Plot -", column)

    # Create the box plot
    boxplot(data[[column]] ~ data$fail_in_30, main = title, xlab = "fail in 30")

    # Display raw data as points if specified
    if (display_raw_data) {
      points(data$fail_in_30, data[[column]], col = "lightgray", pch = 16)
    }
  }
}
# par(mfrow = c(3, 2))
cols_for_chart <- c( "sensor_4", "sensor_7", "sensor_11","sensor_12", "sensor_15")
create_boxplots(data = train_scaled_tbl,
                column_names = cols_for_chart,
                display_raw_data = FALSE
                )

5.0 Assess Outliers
Some algorithms are more sensitive to outliers than others. In our case, logistic regression is more sensitive than xgboost because of how the optimation is calculated.
We note that there are many outliers - due to unfamiliarity with this subject matter, I’m leaving the outliers as-is. Outliers could influence logistic regression, but will have less of an effect on tree-based models like xgboost.
cooksd <- cooks.distance(glm(fail_in_30 ~ .,
                             family = "binomial",
                             data = train_scaled_tbl ))


outliers <- rownames(train_scaled_tbl[cooksd > 4*mean(cooksd, na.rm=T), ])
outliers_title <- length(outliers)



plot(cooksd,
     pch="*",
     cex=1,
     main= paste("Influential Points by Cooks distance (above red line),N=", outliers_title)
)
abline(h = 4*mean(cooksd, na.rm=T), col="red", lwd = 4)

6. Correlations (Linear)
6.1 “Correlation funnel”
This algorithm automatically bins continuous variables.
Each point is the correlation for that bin. We are interested in the points farthest to the right, because that “right-most” point at the top represents our target: “fail_in_30.
We see a few strong correlations with failure - sensors 4,7,11,12,15
# Correlation Funnel 
train_scaled_tbl %>%
    #binarize puts continuous and categorical values into bins
    binarize() %>%
    #correlate when Churn == Yes to all other variables
    correlate(fail_in_30__1) %>% 
    plot_correlation_funnel(interactive = TRUE, alpha = 0.7)
6.2 Correlations - another way
Once again, we see sensors 4,7,11,12,15 showing the strongest correlation.
# prepare data 
correlation_tbl <- train_scaled_tbl %>%
    rename_with(~ str_replace(., "^sensor_", "s"), starts_with("sensor_")) %>%
    rename_with(~ str_replace(., "^setting_", "set"), starts_with("setting_")) %>% 
    mutate(fail_in_30_num = as.numeric(fail_in_30)) %>% 
    select(-fail_in_30)


corr_matrix <- cor(correlation_tbl)
# Create a correlation plot with color shading
corrplot(corr_matrix, method = "color", type = "lower",
          tl.col = "black", tl.srt = 45, diag = FALSE)

6.3 Quantify the correlations >= 0.75
threshold <- 0.75

cor_table <- correlation_tbl %>%
  select_if(is.numeric) %>%
  cor() %>%
  as.data.frame() %>%
  rownames_to_column(var = "variable1") %>%
  gather(variable2, correlation, -variable1) %>%
  mutate(correlation = round(correlation, 2))

# Filter correlations above the threshold
cor_table %>%
  filter( (variable1 != variable2) & (variable1=="fail_in_30_num") & (abs(correlation) >= threshold) ) %>% 
  arrange(desc(abs(correlation))) %>% 
  kbl(caption = "Correlations >= 075") %>%
  kable_classic(full_width = F, html_font = "Cambria")
Correlations >= 075
variable1 variable2 correlation
fail_in_30_num s11 0.79
fail_in_30_num s4 0.78
fail_in_30_num s12 -0.77
fail_in_30_num s7 -0.75
fail_in_30_num s15 0.75
6.4 Examine correlations between predictors
Observe: sensor 9, 14 are almost perfectly correlated
threshold <- 0.9

# Filter correlations with predictors, above the threshold
filtered_cor_pred_table <- cor_table %>%
  filter((variable1 != variable2) & (variable1!="fail_in_30_num") & (abs(correlation) >= threshold))

filtered_cor_pred_table %>% 
    arrange(desc(abs(correlation)))  %>% 
    kbl(caption = "Correlations between Predictors >= 0.9") %>%
    kable_classic(full_width = F, html_font = "Cambria")
Correlations between Predictors >= 0.9
variable1 variable2 correlation
s14 s9 0.98
s9 s14 0.98
7.0 Final Processing
(1) Remove sensor 14 & 9 since they are correlated to each other, and, the line graph shows the correlation among units is not consistent. (2) Remove setting 1 & 2 since there is zero correlation with failure.
train_scaled_tbl <- train_scaled_tbl %>%  
                        select(-sensor_14, -sensor_9,-setting_1, -setting_2)


valid_scaled_tbl <- valid_scaled_tbl %>%  
                        select(-sensor_14, -sensor_9,-setting_1, -setting_2)
8. Modeling “Helpers”
8.1 Simple Decision Tree - what sensors are most important?
Decision Trees give us an intuitive way to learn about important predictors.
Once again, we see sensor 11,4 with the highest predictive power.
set.seed(42)
model_decision_tree <- decision_tree(mode = "classification",
              cost_complexity = 0.0001,
              tree_depth      = 4,
              min_n           = 10) %>%
    set_engine("rpart") %>%
    fit(as.factor(fail_in_30) ~ ., data = train_scaled_tbl)

# summary(model_decision_tree)

model_decision_tree$fit %>%
    rpart.plot(
        roundint = FALSE,
        type = 1,
        extra = 101,
        fallen.leaves = FALSE, #lines are angled
        cex = 0.6, #font size
        main = "Simple Decision Tree",
        box.palette = "Blues"
        )

8.2 Survival Curves

Survival analysis (also called time-to-event analysis or duration analysis) is a branch of statistics aimed at analyzing the expected duration of time until one or more events happen, called survival times or duration times.

In survival analysis, we are interested in a certain event and want to analyze the time until the event happens.

8.2.1 Kaplan Meier Survival
Univariate - the cycle number is the only predictor
Explanation of Mortality Table
“time” = cycle number
“estimate” = probability of survival. For the early cycles, the estimate is 1 (100% are running/surviving)
“n.risk” = number still under observation at this time
“n.event” = number that failed at this time/cycle
survival_tbl <- train_sampled_tbl %>% 
                             mutate(fail_in_30 = as.numeric(case_when(fail_in_30 == 1 ~ 1, 
                                                              TRUE ~ 0
                                                              ))) %>% 
                             select(time_cycles, fail_in_30, sensor_11, sensor_4, sensor_7, sensor_12 ) 

surv_simple_km <- survfit(Surv(time_cycles, fail_in_30) ~ 1,data = survival_tbl)
tidy(surv_simple_km) %>% 
    head(20) %>% 
    kbl(caption = "Kaplan Meier Mortality Table, Top 20") %>%
    kable_classic(full_width = F, html_font = "Cambria") 
Kaplan Meier Mortality Table, Top 20
time n.risk n.event n.censor estimate std.error conf.high conf.low
1 6190 0 20 1 0 1 1
2 6170 0 21 1 0 1 1
3 6149 0 16 1 0 1 1
4 6133 0 12 1 0 1 1
5 6121 0 17 1 0 1 1
6 6104 0 15 1 0 1 1
7 6089 0 12 1 0 1 1
8 6077 0 20 1 0 1 1
9 6057 0 24 1 0 1 1
10 6033 0 21 1 0 1 1
11 6012 0 15 1 0 1 1
12 5997 0 21 1 0 1 1
13 5976 0 19 1 0 1 1
14 5957 0 21 1 0 1 1
15 5936 0 20 1 0 1 1
16 5916 0 19 1 0 1 1
17 5897 0 21 1 0 1 1
18 5876 0 14 1 0 1 1
19 5862 0 14 1 0 1 1
20 5848 0 18 1 0 1 1
8.2.2 Cox Proportional Hazards Model
This is an enhancement to Kaplan Meier in that it is multivariate, and it is a model vs just descriptive.
sensor_11 turns out to be the only one that is significant ( p < 0.05)
model_coxph <- coxph(Surv(time_cycles, fail_in_30) ~ . , data = survival_tbl)

# Regression Estimates
tidy(model_coxph)
## # A tibble: 4 x 5
##   term      estimate std.error statistic p.value
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>
## 1 sensor_11  0.305     0.149       2.04   0.0413
## 2 sensor_4  -0.00522   0.00391    -1.34   0.182 
## 3 sensor_7  -0.0142    0.0404     -0.351  0.725 
## 4 sensor_12 -0.0482    0.0499     -0.967  0.333
Cox Proportional Hazards Mortality Table
# Mortality Table
model_coxph %>%
    survfit() %>%
    tidy() %>% 
    head(20) %>% 
    kbl(caption = "Cox PH Mortality Table, Top 20") %>%
    kable_classic(full_width = F, html_font = "Cambria")
Cox PH Mortality Table, Top 20
time n.risk n.event n.censor estimate std.error conf.high conf.low
1 6190 0 20 1 0 1 1
2 6170 0 21 1 0 1 1
3 6149 0 16 1 0 1 1
4 6133 0 12 1 0 1 1
5 6121 0 17 1 0 1 1
6 6104 0 15 1 0 1 1
7 6089 0 12 1 0 1 1
8 6077 0 20 1 0 1 1
9 6057 0 24 1 0 1 1
10 6033 0 21 1 0 1 1
11 6012 0 15 1 0 1 1
12 5997 0 21 1 0 1 1
13 5976 0 19 1 0 1 1
14 5957 0 21 1 0 1 1
15 5936 0 20 1 0 1 1
16 5916 0 19 1 0 1 1
17 5897 0 21 1 0 1 1
18 5876 0 14 1 0 1 1
19 5862 0 14 1 0 1 1
20 5848 0 18 1 0 1 1
8.2.3 Survival Curves plot function
Both curves are similar; the additional covariates used in Cox did not appreciably change the curve.
The steepest part of the curves is between 150 & 175 cycles.
plot_customer_survival <- function(object_survfit,title) {
    
        tidy(object_survfit) %>%
        ggplot(aes(time, estimate)) +
        geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.5) +
        geom_line(size = 1) +
        theme_tq() +
        scale_color_tq() +
        scale_y_continuous(labels = scales::percent_format()) +
        labs(title = title, 
             x = "Cycles", y = "%Running")
    
}
survfit_coxph <- survfit(model_coxph)
par(mfrow = c(2, 1))
plot_customer_survival(surv_simple_km,"Survival Curve by Cycles: Kaplan Meier")

plot_customer_survival(survfit_coxph, "Survival Curve by Cycles: Cox Ph")

9.0 Modeling
9.1 Create a Confusion Matrix function
This function will help us establish a probability cutoff value based on assigning a cost to false negatives and positives.
calculate_classification_metrics <- function(df, pred, actual, cutoff = 0) {
  # cutoff = 0 will trigger a for-loop to evaluate many thresholds.
  # cutoff > 0 means we want to evaluate a specific cutoff value
    
  # cost for false positives and negatives
  fn_cost <- 20
  fp_cost <- 0.5

  # Set the threshold values
  if (cutoff > 0) {
      thresh = cutoff
  } else {
  thresh <- seq(0.005, 0.5, 0.005)
  }
  
  # Create an empty dataframe to store the results
  results_df <- data.frame(threshold = numeric(length(thresh)),
                           accuracy = numeric(length(thresh)),
                           error_rate = numeric(length(thresh)),
                           sensitivity = numeric(length(thresh)),
                           specificity = numeric(length(thresh)),
                           false_positive_rate = numeric(length(thresh)),
                           false_negative_rate = numeric(length(thresh)),
                           true_pos = numeric(length(thresh)),
                           true_neg = numeric(length(thresh)),
                           false_pos = numeric(length(thresh)),
                           false_neg = numeric(length(thresh)),
                           cost = numeric(length(thresh))
                           )
  
  # Iterate over each threshold value
  for (i in seq_along(thresh)) {
    
      # Create a binary prediction based on the threshold
    df$predicted <- ifelse(df[[pred]] >= thresh[i], 1, 0)
    
    df[[actual]] <- as.factor(df[[actual]])    
    levels_actual <- levels(df[[actual]])
    
    df$predicted <- as.factor(df$predicted)
    levels_predicted <- levels(df$predicted)
    
    l_act  <- (length(levels_actual))
    l_pred <-(length(levels_predicted)) 
    

    
    if (length(levels_actual) != length(levels_predicted))
        next 
    
    # Create the confusion matrix and calculate metrics
    cm <- confusionMatrix(df[[actual]],df$predicted)

    # Store the metrics in the results dataframe
    results_df[i, "threshold"]   <- thresh[i]
    results_df[i, "accuracy"]    <- cm$overall['Accuracy']
    results_df[i, "error_rate"]  <- 1 - cm$overall['Accuracy']

    # Calculate true positives, true negatives, false positives, and false negatives
    # the first number in brackets is the row number, the second is col number
    results_df[i, "true_pos"]  <- cm$table[2, 2]
    results_df[i, "true_neg"]  <- cm$table[1, 1]
    results_df[i, "false_pos"] <- cm$table[1, 2]
    results_df[i, "false_neg"] <- cm$table[2, 1]
    
    # Calculate Sens TP / (TP + FN), Spec TN / (TN + FP)
    results_df[i, "sensitivity"] <- results_df[i, "true_pos"] / (results_df[i, "true_pos"] + results_df[i, "false_neg"])
      
    results_df[i, "specificity"] <- results_df[i, "true_neg"] / (results_df[i, "true_neg"] + results_df[i, "false_pos"])
      
    # Calculate false positive rate and false negative rate
    results_df[i, "false_positive_rate"] <- results_df[i, "false_pos"] / (results_df[i, "false_pos"] + results_df[i, "true_neg"])
    results_df[i, "false_negative_rate"] <- results_df[i, "false_neg"] / (results_df[i, "false_neg"] + results_df[i, "true_pos"])

    # Calculate cost
    results_df[i, "cost"] <- results_df[i, "false_pos"]*fp_cost + results_df[i, "false_neg"]*fn_cost
    

    }
    
    # Return the results dataframe
    return(results_df)
  
}
9.2 Logistic Regression
The first attempt showed that time_cycles_norm was insignificant
All remaining predictors were significant at alpha = 0.05
Since we scaled our data, we can compare the estimates for impact on failure.
Sensor 11 and 4 have the biggest predictive power
set.seed(42)

# model_log_train<-glm( as.factor(fail_in_30) ~ . ,  family = binomial, data = train_scaled_tbl )
# Remove time_cycles_norm -> insignificant

model_log_train<-glm( as.factor(fail_in_30) ~ . ,  family = binomial, data = train_scaled_tbl %>% select(-time_cycles_norm) )

summary(model_log_train)
## 
## Call:
## glm(formula = as.factor(fail_in_30) ~ ., family = binomial, data = train_scaled_tbl %>% 
##     select(-time_cycles_norm))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.99226  -0.10307   0.00216   0.17016   2.82208  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.51340    0.06107  -8.407  < 2e-16 ***
## sensor_2     0.49252    0.10274   4.794 1.64e-06 ***
## sensor_3     0.41981    0.09382   4.474 7.66e-06 ***
## sensor_4     1.05486    0.14015   7.527 5.20e-14 ***
## sensor_7    -0.47980    0.13671  -3.510 0.000449 ***
## sensor_8    -0.36853    0.12702  -2.901 0.003715 ** 
## sensor_11    1.31905    0.16126   8.179 2.85e-16 ***
## sensor_12   -0.73500    0.15430  -4.763 1.90e-06 ***
## sensor_13   -0.37087    0.12491  -2.969 0.002988 ** 
## sensor_15    0.74934    0.11936   6.278 3.43e-10 ***
## sensor_17    0.61021    0.09689   6.298 3.01e-10 ***
## sensor_20   -0.45538    0.11254  -4.046 5.20e-05 ***
## sensor_21   -0.53256    0.11258  -4.731 2.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8581.2  on 6189  degrees of freedom
## Residual deviance: 2065.4  on 6177  degrees of freedom
## AIC: 2091.4
## 
## Number of Fisher Scoring iterations: 7
9.2.1 Log Regression Training Results - find optimal cutoff value based on cost
rslt_log_train <- train_scaled_tbl %>% select(fail_in_30)
model_log_train_predicted <- predict( model_log_train,type ="response") # converts to prob
rslt_log_train_tbl <- data.frame(model_log_train_predicted,rslt_log_train )
colnames(rslt_log_train_tbl) <- c("probability", "actual")

rslt_log_train_metrics_tbl <- calculate_classification_metrics(df=rslt_log_train_tbl, pred="probability", actual="actual", cutoff = 0)

rslt_log_train_metrics_tbl <- rslt_log_train_metrics_tbl %>%  arrange(cost)

cutoff <- rslt_log_train_metrics_tbl$threshold[1]

rslt_log_train_optimal_metrics_tbl <- head(rslt_log_train_metrics_tbl,1)

rslt_log_train_optimal_metrics_tbl <- pivot_longer(rslt_log_train_optimal_metrics_tbl, everything(), names_to = "Metric", values_to = "Value")

rslt_log_train_optimal_metrics_tbl <- rslt_log_train_optimal_metrics_tbl %>%  
    mutate(Value = round(Value,3))

rslt_log_train_optimal_metrics_tbl %>% 
       kbl(caption = "Logistic Regression: Training data results") %>%
       kable_classic(full_width = F, html_font = "Cambria")
Logistic Regression: Training data results
Metric Value
threshold 0.045
accuracy 0.859
error_rate 0.141
sensitivity 0.999
specificity 0.719
false_positive_rate 0.281
false_negative_rate 0.001
true_pos 3091.000
true_neg 2226.000
false_pos 869.000
false_neg 4.000
cost 514.500
9.2.2 Validation
Apply the model with the cutoff value found above to the validation set
rslt_log_valid <- valid_scaled_tbl %>% select(fail_in_30)
model_log_valid_predicted <- predict(model_log_train,  newdata = valid_scaled_tbl, type = "response"  )

rslt_log_valid_tbl <- data.frame(model_log_valid_predicted, rslt_log_valid  )
colnames(rslt_log_valid_tbl) <- c("probability", "actual")

rslt_log_valid_metrics_tbl <- calculate_classification_metrics(df=rslt_log_valid_tbl, pred="probability", actual="actual", cutoff = cutoff)

rslt_log_valid_metrics_tbl <- pivot_longer(rslt_log_valid_metrics_tbl, everything(), names_to = "Metric", values_to = "Value")

rslt_log_valid_metrics_tbl <- rslt_log_valid_metrics_tbl %>%  
    mutate(Value = round(Value,3))

rslt_log_valid_metrics_tbl %>% 
       kbl(caption = "Logistic Regression: Validation data results") %>%
       kable_classic(full_width = F, html_font = "Cambria")
Logistic Regression: Validation data results
Metric Value
threshold 0.045
accuracy 0.884
error_rate 0.116
sensitivity 0.988
specificity 0.881
false_positive_rate 0.119
false_negative_rate 0.012
true_pos 328.000
true_neg 11243.000
false_pos 1521.000
false_neg 4.000
cost 840.500
9.2.3 Logistic ROC for Validation data
# ROC
valid_ROCprediction<-prediction(model_log_valid_predicted, rslt_log_valid_tbl$actual)  
valid_ROCperformance<-performance(valid_ROCprediction, "tpr", "fpr") 
plot(valid_ROCperformance, colorize=TRUE)

9.2.4 Logistic AUC for Validation data
#AUC
valid_AUC <- unlist(slot(performance(valid_ROCprediction, "auc"), "y.values"))
print(paste("The AUC for validation is:", valid_AUC))
## [1] "The AUC for validation is: 0.985308359731625"
9.2.5 Sanity Check: Let’s filter the original datasets for just the failures, and look at the predictions.
For the validation set, the predicted probability spanned a wider range vs the train dataset
# par(mfrow =c(2,1))
train_scaled_fail_predict <- cbind(train_scaled_tbl,model_log_train_predicted) %>% 
                             select(fail_in_30, model_log_train_predicted) %>% 
                             filter(fail_in_30 ==1)%>% 
                             arrange(model_log_train_predicted)

hist(train_scaled_fail_predict$model_log_train_predicted, main = "Train Predicted Prob for Actual Failures", xlab = "", breaks=30)

valid_scaled_fail_predict <- cbind(valid_scaled_tbl,model_log_valid_predicted) %>% 
                             select(fail_in_30, model_log_valid_predicted) %>% 
                             filter(fail_in_30 == 1) %>% 
                             arrange(model_log_valid_predicted)

hist(valid_scaled_fail_predict$model_log_valid_predicted, main = "Validation Predicted Prob for Actual Failures", xlab = "", breaks=30)

10 xgboost
Prepare predictors and target variables
boost_train_scaled_tbl  <- train_scaled_tbl %>% 
                           mutate(fail_in_30 = as.integer(case_when(fail_in_30 == 1 ~ 1,
                                                         TRUE ~ 0)))

boost_valid_scaled_tbl  <- valid_scaled_tbl %>% 
                           mutate(fail_in_30 = as.integer(case_when(fail_in_30 == 1 ~ 1,
                                                         TRUE ~ 0)))

y_train <- as.integer(boost_train_scaled_tbl$fail_in_30) 
y_test  <- as.integer(boost_valid_scaled_tbl$fail_in_30) 
X_train <- boost_train_scaled_tbl %>% select(-fail_in_30)
X_test  <- boost_valid_scaled_tbl %>% select(-fail_in_30)
create xgb DMatrix and params
xgb_train <- xgb.DMatrix(data = as.matrix(X_train), label = y_train)
xgb_test  <- xgb.DMatrix(data = as.matrix(X_test),  label = y_test)
params <- list(
  objective = "binary:logistic",
  eval_metric = "error",
  max.depth = 3,           # the maximum depth of each decision tree default 6
  early_stopping_rounds = 3 # if we dont see an improvement in this many rounds, stop
)
XGB Model
set.seed(42)
nrounds <- 10  # Set the number of boosting rounds
xgb_model <- xgboost(data = xgb_train, params = params, nrounds = nrounds)
## [20:08:29] WARNING: amalgamation/../src/learner.cc:627: 
## Parameters: { "early_stopping_rounds" } might not be used.
## 
##   This could be a false alarm, with some parameters getting used by language bindings but
##   then being mistakenly passed down to XGBoost core, or some parameter actually being used
##   but getting flagged wrongly here. Please open an issue if you find any such cases.
## 
## 
## [1]  train-error:0.082714 
## [2]  train-error:0.079321 
## [3]  train-error:0.068982 
## [4]  train-error:0.067690 
## [5]  train-error:0.065913 
## [6]  train-error:0.064943 
## [7]  train-error:0.064459 
## [8]  train-error:0.065267 
## [9]  train-error:0.063005 
## [10] train-error:0.063005

Feature Importance - once again, sensor 11 & 4 are the strongest

# get information on how important each feature is
importance_matrix <- xgb.importance(names(X_train), model = xgb_model)
xgb.plot.importance(importance_matrix)

XGBoost - find optimal prediction cutoff value based on cost.

# generate predictions for training
xgb_pred_train <- predict(xgb_model, xgb_train)

# create df with predictions and the actual values
xgb_train_tbl <- data.frame(xgb_pred_train,rslt_log_train)
colnames(xgb_train_tbl) <- c("probability", "actual")

# get cost based on different thresholds
rslt_xgb_train_metrics_tbl <- calculate_classification_metrics(df=xgb_train_tbl, pred="probability", actual="actual", cutoff = 0)


rslt_xgb_train_metrics_tbl <- rslt_xgb_train_metrics_tbl %>%  
                              # function can skip thresh if levels!=2, default thresh 0
                              filter(threshold > 0) %>%  
                              arrange(cost)

xgb_cutoff <- rslt_xgb_train_metrics_tbl$threshold[1]

rslt_xgb_train_optimal_metrics_tbl <- head(rslt_xgb_train_metrics_tbl,1)

rslt_xgb_train_optimal_metrics_tbl <- pivot_longer(rslt_xgb_train_optimal_metrics_tbl, everything(), names_to = "Metric", values_to = "Value")

rslt_xgb_train_optimal_metrics_tbl <- rslt_xgb_train_optimal_metrics_tbl %>%  
    mutate(Value = round(Value,3))

rslt_xgb_train_optimal_metrics_tbl %>% 
    kbl(caption = "xgboost: Training data results") %>%
    kable_classic(full_width = F, html_font = "Cambria")
xgboost: Training data results
Metric Value
threshold 0.110
accuracy 0.892
error_rate 0.108
sensitivity 0.999
specificity 0.784
false_positive_rate 0.216
false_negative_rate 0.001
true_pos 3093.000
true_neg 2428.000
false_pos 667.000
false_neg 2.000
cost 373.500

XGBoost - Validation

# generate predictions
xgb_pred_test <- predict(xgb_model, xgb_test)

# create df with predictions and the actual values
xgb_test_tbl <- data.frame(xgb_pred_test,rslt_log_valid)
colnames(xgb_test_tbl) <- c("probability", "actual")

# get cost based on different thresholds
rslt_xgb_test_metrics_tbl <- calculate_classification_metrics(df=xgb_test_tbl, pred="probability", actual="actual", cutoff = xgb_cutoff)

rslt_xgb_test_metrics_tbl <- pivot_longer(rslt_xgb_test_metrics_tbl, everything(), names_to = "Metric", values_to = "Value")

rslt_xgb_test_metrics_tbl <- rslt_xgb_test_metrics_tbl %>%  
    mutate(Value = round(Value,3))

rslt_xgb_test_metrics_tbl %>% 
    kbl(caption = "xgboost: Validation data results") %>%
    kable_classic(full_width = F, html_font = "Cambria")
xgboost: Validation data results
Metric Value
threshold 0.110
accuracy 0.917
error_rate 0.083
sensitivity 0.970
specificity 0.916
false_positive_rate 0.084
false_negative_rate 0.030
true_pos 322.000
true_neg 11692.000
false_pos 1072.000
false_neg 10.000
cost 736.000
What is the business value in using machine learning for predictions?
Let’s compare the machine learning results to a baseline: assume there are no failures in the validation dataset. The implication is that we will run-to-failure and never shut down in advance for repair. Net: there is no prediction, we only need to find those that failed and multiply by the false negative cost. Then we’ll compare that cost to our models.
  • Baseline cost: 6640
  • Logistic regression cost: 840.5
  • xgboost cost: 736

Both machine learning models provide significant value.

fn_cost = 20

valid_fail <- valid_scaled_tbl %>% 
     select(fail_in_30) %>% 
     mutate( fail_in_30_num = as.numeric(case_when (fail_in_30 == 1 ~ 1,
                                        TRUE ~0))
     )

valid_naiive_cost <- sum(valid_fail$fail_in_30_num) * fn_cost 
valid_logistic_cost <- rslt_log_valid_metrics_tbl %>% 
                       filter(Metric == 'cost') %>% 
                       select(Value) 
valid_logistic_cost <- valid_logistic_cost$Value[1]

valid_xgb_cost <- rslt_xgb_test_metrics_tbl %>% 
                       filter(Metric == 'cost') %>% 
                       select(Value) 
valid_xgb_cost <- valid_xgb_cost$Value[1]

paste("The naiive approach cost is:", valid_naiive_cost, "the cost of the logistic regression was",valid_logistic_cost, "the xgboost cost was",valid_xgb_cost )
## [1] "The naiive approach cost is: 6640 the cost of the logistic regression was 840.5 the xgboost cost was 736"